Machine learning prediction of hematoma expansion in acute intracerebral hemorrhage

To examine whether machine learning (ML) approach can be used to predict hematoma expansion in acute intracerebral hemorrhage (ICH) with accuracy and widespread applicability, we applied ML algorithms to multicenter clinical data and CT findings on admission. Patients with acute ICH from three hospitals (n = 351) and those from another hospital (n = 71) were retrospectively assigned to the development and validation cohorts, respectively. To develop ML predictive models, the k-nearest neighbors (k-NN) algorithm, logistic regression, support vector machines (SVMs), random forests, and XGBoost were applied to the patient data in the development cohort. The models were evaluated for their performance on the patient data in the validation cohort, which was compared with previous scoring methods, the BAT, BRAIN, and 9-point scores. The k-NN algorithm achieved the highest area under the receiver operating characteristic curve (AUC) of 0.790 among all ML models, and the sensitivity, specificity, and accuracy were 0.846, 0.733, and 0.775, respectively. The BRAIN score achieved the highest AUC of 0.676 among all previous scoring methods, which was lower than the k-NN algorithm (p = 0.016). We developed and validated ML predictive models of hematoma expansion in acute ICH. The models demonstrated good predictive ability, showing better performance than the previous scoring methods.

The hemorrhage locations were categorized as basal ganglia, thalamus, lobe, brain stem, and cerebellum. The presence of intraventricular extension of hemorrhage was noted. The hematoma volume was calculated with the ABC/2 formula 27 . Hematoma expansion was defined as an increase in volume between baseline and follow-up CT scans exceeding 6 cm 3 or 33% of the baseline volume [16][17][18][19][20]28 .
Intrahematoma hypodensities, irregular hematoma shape, and blend sign were identified as noncontrast CT markers. Intrahematoma hypodensities were defined as presence of any hypodense region encapsulated within the hematoma having any morphology and size, separated from the surrounding parenchyma 3,4,12,14 . Irregular hematoma shape was defined as presence of 2 or more hematoma edge irregularities 4,7,9,12 . Blend sign was defined as blending of relatively hypoattenuating area with adjacent hypoattenuating region within a hematoma with a well-defined margin and at least 18 Hounsfield units difference from these regions 4,6,8,12 . When available, CT angiography spot sign was evaluated, which was defined as follows: (1) ≥ 1 focus (attenuation ≥ 120 Hounsfield units) of any size and morphology of contrast pooling within a hematoma, and (2) discontinuous from normal or abnormal vasculature adjacent to the hematoma 15,29 . The CT markers were independently evaluated by 2 observers. When the evaluation by observers disagreed, the CT images were re-evaluated by both observers together, with consensus being developed.
Inhospital management. After identification of ICH on baseline CT scan, continuous blood pressure monitoring and blood pressure-lowering treatment were initiated. Calcium channel blockers, mainly intravenous nicardipine, were administered as antihypertensive agents throughout the period between baseline and follow-up CT scans. The target systolic blood pressure was less than 140 mmHg or 180 mmHg. Statistical analysis. Continuous variables were summarized using a mean with standard deviation or a median with interquartile range and compared using Student's t test or Mann-Whitney U test, depending on the distribution of the variable assessed by the Shapiro-Wilk test. Categorical variables were summarized using a count with percentages and compared using Fisher's exact test.
To confirm the superiority of predictive models using ML over the previous scoring methods, the BAT, BRAIN, and 9-point scores in the validation cohort were calculated [16][17][18][19] . The receiver operating characteristic (ROC) curve was drawn, where the best cutoff value by the Youden's index was determined. In each scoring method, accuracy, sensitivity, specificity, and the area under the ROC curve (AUC) for the prediction of hematoma expansion were computed. The AUC of the three scores and that of ML models were compared using DeLong test.
All statistical analyses were performed using EZR (Saitama Medical Center, Jichi Medical University, Saitama, Japan) 30 , which is a graphical user interface for R (The R Foundation for Statistical Computing, Vienna, Austria). To develop predictive models, supervised ML algorithms were adopted, in which pairs of the input data and the output class were given to the algorithm, which found a way to generate the output class from the input data 31 . The k-nearest neighbors (k-NN) algorithm, logistic regression, support vector machines (SVMs), random forests, and XGBoost were selected as the supervised algorithms. The k-NN algorithm is the simplest ML algorithm, which finds k neighbors closest to a new observation in the stored training data and makes a prediction by assigning the majority class among these neighbors 31 . Logistic regression is a binary classifier, in which a linear model is included in a logistic function and the probability that a new observation is a member of each class is computed 31 . SVMs find the hyperplane that maximizes the margin between classes in the training data, making a prediction based on the distances to the support vectors and the importance of support vectors 31 . Random forests train many decision trees, where each tree only receives a bootstrapped observation of training data and each node only considers a subset of features when determining the best split, making a prediction in accordance with the averaged probabilities predicted by all the trees 31 . XGBoost is a gradient boosting algorithm, which works by building decision trees in a serial manner, where each tree tries to correct the mistakes of the previous one; and the probability is computed by summing the weight of the leaves to which a new observation belongs in each decision tree 31 . With each supervised algorithm, predictive model development using the patent data of the development cohort (training data set) and external validation using that of the validation cohort (test data set) were planned.

Machine learning environment and algorithms.
Feature selection and scaling, and oversampling. Baseline clinical variables, CT findings including hemorrhage locations, intraventricular hematoma extension, baseline hematoma volume, and noncontrast CT markers, and target systolic blood pressure were applied as the input data, while hematoma expansion was applied as the output class.
Since there were 31 individual properties of the input data, which were called features, feature selection was performed to lead to simpler models that generalize better 31 . Firstly, univariate analyses with Student's t test, Mann-Whitney U test, and Fisher's exact test were performed between expansion and no expansion groups in the training data set. Secondly, the features were ranked in accordance with their P values. Finally, 5 to 10 features with the smallest P values were selected. Feature scaling was performed using standardization in SVMs, which required all the features to vary on a similar scale to perform well.
Given the imbalance of the output class distribution, random oversampling was employed. Random oversampling involved randomly selecting observations from the minority group with replacement and adding them to the training data set.

Predictive model development and external validation. Each supervised ML algorithm was applied
to the training data set with 5 to 10 selected features and all 31 features. In the predictive model development process, stratified 30-fold cross-validation was used to assess generalization performance, in which the training data set was split such that the proportions between output classes were the same in each fold as they were in the whole training data set 31 . The hyperparameters were tuned manually in each algorithm as shown in Table 1 to improve generalization performance, while the other hyperparameters not listed in Table 1 were used as default.
After the model development, each model was evaluated for its performance on the test data set as external validation, where accuracy, sensitivity, specificity, and the AUC for the prediction of hematoma expansion were computed. www.nature.com/scientificreports/

Results
After application of the inclusion and exclusion criteria, 351 of 930 patients in the development cohort and 71 of 212 patients in the validation cohort were evaluated (Fig. 1). Hematoma expansion occurred in 71 patients (20.2%) in the development cohort and in 26 patients (36.6%) in the validation cohort (Table 2). On comparison between expansion and no expansion groups in the development cohort, 10 variables with the smallest P values were baseline hematoma volume, intrahematoma hypodensities, PT-INR, anticoagulant use, lobar hemorrhage, irregular hematoma shape, platelet count, sex, time from onset to baseline CT scan, and cerebellar hemorrhage in increasing order (Table 3): these were used as selected features.
The k-NN algorithm achieved the highest AUC of 0.790 (95% confidence interval [CI], 0.693-0.886) among all ML models, where 9 selected features were used and the hyperparameter n_neighbors was 5 (Table 4) The best cutoff values in the previous scoring methods were 3 in the BAT score, 9 in the BRAIN score, and 4 in the 9-point score. Although the BRAIN score achieved the highest AUC of 0.676 (95% CI, 0.579-0.772) among all previous scoring methods, the k-NN algorithm that achieved the best performance of all ML models showed higher AUC than the BRAIN score (0.790 vs. 0.676; p = 0.016) ( Table 4).

Discussion
We developed and validated ML predictive models of hematoma expansion in acute ICH. The models demonstrated good predictive ability, showing better performance than the previous scoring methods. Multicenter data and multivendor CT images were used for model development, so that the models were generalizable and widely applicable.
Thirty-one features, consisting of baseline clinical variables, CT findings, and target systolic blood pressure, were put into the model development process. Clinical variables only contained general patient information and blood test findings. Thus, they could be easily collected in clinical practice. All CT findings were obtained from noncontrast CT scans; and CT scan data included those performed with a thickness of 0.5-10.0 mm. Although the spot sign, which is also included in the 9-point score, is useful for predicting hematoma expansion, CT angiography is available in a limited number of hospitals. Additionally, although noncontrast CT markers are usually evaluated with a thickness of 5.0 mm, in clinics or developing countries, CT scans are not uncommonly performed with a thickness of more than 5 mm. Therefore, in order that predictive models could be used in many hospitals and countries, we acquired and analyzed CT scan data for such conditions. We experimentally included target systolic blood pressure in the features because it could be determined at admission. However, there was no statistical difference regarding target systolic blood pressure between expansion and no expansion groups in the development cohort. Therefore, target systolic blood pressure was not included in the features of the best ML model.
Feature selection was performed to develop simpler ML predictive models. When developing models using many features, or a high-dimensional data set, models become complex and the chance of overfitting increases 31 . There are three basic strategies for selecting features: model-based selection, iterative selection, and univariate analysis 31 . Model-based selection utilizes supervised ML models such as linear models and decision tree-based models to judge the importance of each feature. In iterative selection, a series of models for feature selection are built, where the features with higher importance are selected. These methods consider all features at once and may be able to capture interactions between features. However, when the performance of the models for feature selection is low, selected features could be unreliable. Univariate analysis was the one that we chose in this study,  www.nature.com/scientificreports/ where a correlation between individual features was ignored and therefore features that were only informative when combined with other features were discarded. Still, we showed good performance in the best ML model using univariate analysis, but there may be better feature selection methods. However, there is one caveat: elaborate feature selection may lead to overfitting, resulting in reducing model performance.
We have made the raw data and the programming code of ML algorithms available on the websites to ensure reproducibility of the developed models: we believe that this is the most important point for the clinical studies using ML. There may be better ML approaches than what we have shown in this study, and better ML algorithms that can achieve higher performance may be created in the future. By using ML approaches, we can easily add the data of other facilities and develop more robust and reliable ML models. With the maturity of ML technology and its usage environment, it is becoming easier for clinicians to learn ML and apply it to clinical research. We hope that our data and algorithms will be widely used and applied to new analyses.
ML approaches have been used in medical research and often perform better than classical statistical models 21,22 . In this study, even though there were some statistical differences in patient characteristics between the development and validation cohorts (Table 2), the developed ML models showed better predictive ability than the previous scoring methods, such as the BAT, BRAIN, and 9-point scores, in the validation cohort [16][17][18][19] . Table 2. Characteristics of the development and validation cohorts. Data are presented as n (%), mean ± standard deviation, or median (interquartile range). CT = computed tomography; PT-INR = prothrombin time-international normalized ratio. *Mann-Whitney U test between the development and validation cohorts. † Fisher's exact test between the development and validation cohorts. ‡ Student's t test between the development and validation cohorts. **CT angiography is not performed in all patients. www.nature.com/scientificreports/ Several clinical studies have investigated the relationships between lowering of blood pressure and the outcome in patients or hematoma expansion though no conclusion has been reached yet [32][33][34][35] . However, the ultraearly lowering of blood pressure may benefit patients with acute ICH 36 . Moreover, anticoagulant reversal may reduce hematoma expansion 37 . The developed ML models in this study may be useful, especially in ultra-early phase or when anticoagulants are given, for selecting patients who require more careful treatment.
A few limitations should be noted. First, more patients in the development and validation cohorts are needed to achieve more robust quality and more satisfactory performance of ML predictive models. It is hard to determine the appropriate number of patients in ML analyses because it depends on the quality of the input data. However, efforts are required to increase the number of patients and to make sure that model performance have reached a plateau irrespective of an increase of the number of patients 38 . Second, CT findings were evaluated by humans. If we utilize an artificial neural network for analyzing CT scan data, we can create hybrid models that unify analyses of imaging data and clinical variables within a ML pipeline. The hybrid models are likely to achieve higher predictive performance. However, as a serious problem, brain image data usually contain face information, which cannot easily be shared.
In conclusion, we developed widely applicable predictive models of hematoma expansion in acute ICH by applying ML algorithms to clinical data and noncontrast CT findings. The models showed better performance than the previous scoring methods. We have made the raw data and the programming code available on the websites so that anyone can utilize and improve the models. Table 3. Univariate analyses between expansion and no expansion groups in the development cohort. Data are presented as n (%), mean ± standard deviation, or median (interquartile range). CT = computed tomography; PT-INR = prothrombin time-international normalized ratio. *Mann-Whitney U test between expansion and no expansion groups. † Fisher's exact test between expansion and no expansion groups. ‡ Student's t test between expansion and no expansion groups.